Load Library

The Real Problem

What is Pneumonia? Pneumonia is an infection in one or both lungs. Bacteria, viruses, and fungi cause it. The infection causes inflamma)on in the air sacs in your lungs, which are called alveoli.Pneumonia accounts for over 15% of all deaths of children under 5 years old internationally. In 2017, 920,000 children under the age of 5 died from the disease. It requires review of a chest radiograph (CXR) by highly trained specialists and confirmation through clinical history, vital signs and laboratory exams. Pneumonia usually manifests as an area or areas of increased opacity on CXR. However, the diagnosis of pneumonia on CXR is complicated because of a number of other conditions in the lungs such as fluid overload (pulmonary edema), bleeding, volume loss (atelectasis or collapse), lung cancer, or post- radiation or surgical changes. Outside of the lungs, fluid in the pleural space (pleural effusion) also appears as increased opacity on CXR. When available, comparison of CXRs of the patient taken at different time points and correlation with clinical symptoms and history are helpful in making the diagnosis.CXRs are the most commonly performed diagnostic imaging study. A number of factors such as positioning of the patient and depth of inspiration can alter the appearance of the CXR, complicating interpretation further. In addition, clinicians are faced with reading high volumes of images every shiU. Pneumonia Detec1on: Now to detection Pneumonia we need to detect Inflamma1on of the lungs. In this project, you’re challenged to build an algorithm to detect a visual signal for pneumonia in medical images. Specifically, your algorithm needs to automa)cally locate lung opacities on chest radiographs. Business Domain Value: Automating Pneumonia screening in chest radiographs, providing affected area details through bounding box.Assist physicians to make beZer clinical decisions or even replace human judgement in certain func)onal areas of healthcare (eg, radiology).Guided by relevant clinical ques)ons, powerful AI techniques can unlock clinically relevant information hidden in the massive amount of data, which in turn can assist clinical decision making.

Problem Statement

The problem statement at hand needs us to build pneumonia detection system. The data consists of the chest X-ray images of the patient. The intent is to build an object detection model capable of i) classifying the image as pneumonia or not pneumonia ii) localising the site of pneumonia within the lung in case of pneumonia image, with good accuracy.

Load data

Data info: Shape of the data was found to be (30227, 6), implying 30227 no of rows/patient images with 6 columns. The six columns were found to be patientId, x, y, width, height and Target. Of these numerical attributes were x, y, width and height.Target was a categorical attribute. It was observed that the target attribute had two values. The value “0” denotes no disease or not pneumonia and value “1” denotes pneumonia. Shape of the class_label data frame was found to be (30227, 2), implying 30227 no. of rows/patient images with 2 columns. The two columns were found to be patientId and class label.class label was a categorical attribute with three values No Lung Opacity/Not Normal, Normal and Lung opacity.

EDA

Null Values

The values for bounding box coordinates (x,y, width and height) were null for all the patients with target value “0” and bounding box coordinates (x,y, width and height) were defined only for target “1”. However, since “0” denotes no disease, no bounding box and hence no coordinates were expected. So, such values were not dropped considering null values.

Check Target

• 68.39% - Out of 30227, 20672 are identified as Negative with Pneumonia • 31.61% - Out of 30227, 9555 are identified as Positive with Pneumonia This implies slight imbalance in the taget varaible. While, building model, we need focus on this imbalance.

Distribution of class

The patients target values were further mapped to three class labels. However it was observed that all the patients with target value “1” had only one class label “Lung opacity”. The other two class labels were associated with only Target “0”. This observation is in line with the fact that pneumonia on X-ray image is diagnosed as lung opacity. No Lung Opacity / Not Normal 11821 39.1% Lung Opacity 9555 31.6% Normal 8851 29.3%

Check duplicate patient ID samples

Unique Patients: While checking for duplicate values, it was found that some of the patientIDs were same. However, the bounding box coordinates for the patientID with same value were different. Implying having more than one bounding box for some patients.

Bounding box

Check for each patient ID, how many bounding box are present in dataset.

Merge info and data

Getting the center position of each bounding box

Below figure show distribution of locations where bounding bow is mostly observed positive cases

Distribution of ViewPosition

Over all

Posterior/Anterior (PA): In PA, X-Ray beam hits the posterior (back) part of the chest before the anterior (front) part. While obtaining the image patient is asked to stand with their chest against the film. Anterior/Posterior (AP): At times it's not possible for radiographers to acquire a PA chest X-ray. This is usually because the patient is too unwell to stand. AP projection images are of lower quality than PA images. Heart size is exaggerated (cardiothoracic ratio approximately 50%)

Evidence of pneumonia

Patients with sikness: for 81.5% cases pneumonia is detected from AP position. Also, we can assume that for most the patients Pnemonia was detected when there was considerable amount of infection.

Creating bins to categorise the age feature into 4 groups:

In below sections, we will try to understand in which age group the infected patients are more. Lets create 4 age buckets and see the distribution and perc of infected patients Majority of patients belong to (25-50) and (50-75) age group

Creating distribution of age with pneumonia evidence, by gender and count plot of gender

Observation: Majority of patients are male who are suffering from sikness. Age Density plot show: Male: High density in age group (25-80) Female: High density in age group (40-80) There are more younger males who are positive as compared to females.

Correlation

Below chart shows heat map of correlations between attributes. It shows high correlation between width and height which is 0.6. It means when width is changing , height is also getting impacted.

Analysing the dicom images

Dicom images in the training data were visualised both with and without sikness.

With Sikness

Without Sickness

Get Meta data of each patient

Model Building

Reading the image data and append it to the training_data dataset

Reading Dataset again for modeling

Take sample dataset

Considering samples dataset as model will take too much time for execution

Pre Processing the image

For each patient ID we will read dicom images and reshape them to 128X128 size. Then we will create lables and images, for modeling.